No Null values

There's one duplicated record

The duplicated record is now dropped from the dataset.

Exploratory Data Analyis EDA

Variable Description

- age : age of the patient in years

- sex : (1 = male; 0 = female)

- cp : chest pain type (0 = typical angina,1 = atypical angina, 2= non-anginal pain, 3 = asymptomatic)

- trestbps : resting blood pressure (in mm Hg on admission to the hospital)

- chol : serum cholestoral in mg/dl

- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- restecg : resting electrocardiographic results (0 = normal, 1 = ST-T wave abnormality, 2 = Possible LV hypertrophy)

- thalach : maximum heart rate achieved

- exang : exercise induced angina (1 = yes; 0 = no)

- oldpeak : ST depression induced by exercise relative to rest

- slope : the slope of the peak exercise ST segment ( 0 = Upsloping, 1 = Flat slope , 2 = Downsloping)

- ca : number of major vessels (0-3) colored by flourosopy

- thal : 3 = normal; 6 = fixed defect; 7 = reversable defect

- target : 0 = lower chance of heart attack , 1 = higher chance of heart attack

Univariate Analysis of Categorical Variables

More than half of the patient population is at high risk for CVD.

Percentage Chest Pain Type

Almost half of the patient population in the dataset have typical anginal pain with non-anginal pain coming in second place.

The majority of patient population in dataset is of males with 68%.

This shows that the majority of the patient population does not have a fasting blood sugar > 120 mg/dl. From domain knowledge, fbs <120 mg/dl is normal whereas fbs>120mg/dl increases risk of heart attack.

Half of the patient population of dataset has abnormal resting electrocardiographic results with ST-T wave abnormality. While the other half has a normal resting electrocardiographic result. Only 1.3% of patients have abnormal restecg results due to possible left ventricle hypertrophy.

Here we can see that the majority of patient population in dataset does not suffer from exercise induced angina.

Flouroscopy helps physicians detect bloodflow in arteries/vessels. A colored vessel means a vessel that has blood flowing through it. An vessel that doesn't color is a vessel where blood flow is impeded. More than half of the patient population have no major vessel colored by flouroscopy. Though,around quarter of the patients have 1 major vessel colored by flouroscopy.

7% of patients have an increased exercise ST slope which is normal. Half of the patients have a flat exercise ST slope; a flat ST slope is normal during rest but is not normal during exercise which is the case here. While the other half have a down exercise ST slope which indicates exercise-induced ST depression.

I will assume that Thalassemia in this dataset is the Beta Thalassemia Type which is the most prevalent and is mostly correlated with CVD. There are three types of B-Thalassemia: major, intermedia and minor. There was no explanation in the dataset about what these values stand for (0,1,2,3). I will assume 0 = No Thalassemia, 1= Thalassemia major, 2= Thalassemia intermedia, 3= Thalassemia minor.The chart shows that more than half of the patients suffer from Thalassemia intermedia,38.7% of the patients suffer from Thalassemia minor , and a low of 6% suffer from Thalassemia major which is the most severe form of the disease.

Bivariate Analysis

Let's study the occcurence of CVD across different ages

This treemap shows that occurence of CVD is independent of age. An increase in age doesn't mean an increased risk of CVD. The highest occurence of CVD seems to be around the ages of 41 and 42. Ages between 51 and 53 come in second place in terms of an increased risk of CVD.

There is no correlation between Resting blood pressure and CVD risk. A patient of higher Restbp doesn't necessarily have an increased risk of a heart attack. The highest chance of a heart attack seems to be the highest at a restbp of around 120 and 140 systolic bp. Notice that patients with the same restbp values (120 and 140) actually have a lower risk of CVD.

N.B : Count for both cases(high and low risk) around chol 120-140 mg/dl is very close.

Let's explore all variables in dataset

This shows that women are at a higher risk of heart attack in this dataset, whereas men are at a lower risk of heart attack.

Age showed no strong association with CVD risk, whereas sex did show a strong association (female patients are at a higher risk). I decided to continue my EDA through exploring different variables with respect to sex.

The eldest patients in the dataset are females.

The vast majority of males have typical angina for chest pain type with low risk of heart attack. Whereas for females, the types typical angina and non-anginal pain are almost present in same frequency. Female patients with non-anginal pain are at a high risk of heart attack. Thus, for males : atypical angina is correlated with an increased risk of heart attack. For females: non-anginal pain and atypical angina are correlated with highest risk of heart attack.

For males, fbs>120 mg/dl puts the patient at a slightly higher risk for cvd compared to fbs<120 mg/dl; which is similar to what we know in domain knowledge. But for females the scenario is different; the majority of female patients who are at a very high risk for cvd presented with a fbs<120 mg/dl.

Both charts show that an increase in total cholesterol is not correlated to an increased risk of CVD. For both male and female patients, patients with higher levels of total cholesterol (>200 mg/dl) are showing a lower risk of CVD.

N.B. Normal TC level for both males and females >20 yrs is 125-200 mg/dl.

Possible LV hypertrophy is only present in female patients in dataset.

Males with ST-T wave abnormality presented with a higher cvd risk than those with normal results. Females with ST-T wave abnormality presented with highest cvd risk .

It's worth noting,female patients with NORMAL resting electrocardiographic results presented with a higher cvd risk than male patients with ABNORMAL ST-T wave results.

For both male and female patients, exercise induced angina presented no increase in cvd risk.Male and female patients with absent exercise induced angina presented with a higher risk for cvd especially females.

As we go deeper into our analysis, it's becoming evident that one variable associated with an increased cvd risk in females, can actually have an opposite association in males .

Both males and females with a downward ST slope show increased risk for cvd. But, females with a downward ST slope show a much higher risk for cvd compared to their male counterparts.

Fluorsocopy detects blood flow in arteries and vessels. When a certain vessel isn't colored; this indicates there's impeded blood flow.

Majority of male patients had no vessel colored out of the 4 vessels being checked out. This group has an increased risk for cvd. Strange enough, very few male patients(count=5)had all 4 vessels colored showing blood flow but presented with the highest risk for cvd compared to all male groups.

Majority of female patients had no vessel colored as well, but they presented with a higher risk for cvd compared to the respective male patients group. Female patients with 1 colored vessel presented with an increased risk for cvd as well.

This shows that the number of vessels that don't have normal blood flow is associated with an increased cvd risk.

Both male and female patients with Beta Thalassemia intermedia presented with highest risk for cvd compared to patients with minor and major Thal. But within Beta Thal intermedia patients, females presented with the higher risk for cvd compared to males.

From the profiler, the skewness of chol is 1.14 which is >1. This means that total chol wrt to cvd risk is highly skewed.

The highest risk for cvd is present with chol levels of around 200 and 300 mg/dl. The data for low risk cvd is concentrated, as well, where total chol levels are around 200 and 300 mg/dl (but with much lower density).

Patients with a downward ST slope during peak exercising presented with an increased risk for heart attack; especially in female patients ( check treemap above).

Patients with a higher maximum heart rate achieved presented with a higher risk for heart attack

Thalach is maximum heart rate achieved.

Old peak is the ST depression induced by exercise relative to rest.

Patients with a lower peak of ST depression presented with a higher risk for cvd.

After oldpeak=2, cases of high cvd risk decreased.

Concentration of high risk of cvd cases is between a resting bp of 120 and 140. Concentration of low risk of cvd cases is ALSO between a resting bp of 120 and 140. Thus, trestbps is an irrelevant variable in predicting cvd risk.

Let's carry out a scatterplot heat matrix to understand the relationship of all the variables in dataset.

From the above EDA, I can make the following conclusions:

Variation in : age, sex, cp, restecg, ca, exang, slope, Thal, fbs, chol, oldpeak, thalach is correlated to a lower/higher risk for heart attack.

Whereas variables: trestbps appeared as irrelevant to predicting CVD risk. A resting blood pressure of 120-140 mg/dl showed highest concentrations for both low and high CVD risk cases with very close densities for both cases.

Model Selection

Since the target variable in this task is of a categorical binary type; I will choose Logistic regression ( a suprevised classification algorithm) to help predict the impact of each variable on the odds ratio of the observed event of interest which is risk for CVD.

Feature encoding

Feature Selection

Splitting into train and test

Model Fitting

instantiate the model using the default parameters

I fitted the model on training data. I will predict the labels of test data.

The prediction is over. Now I will evaluate the performance of the model.

This means we got 55 correct predictions and 6 incorrect predictions through this model

Accuracy of a model is the proportion of predictions that the model classified correctly. Which is 90.16% for our model;90% of predictions are classified correctly.

F1 score is the harmonic mean of precision and recall. It is the measure of the preciseness and robustness of the model. The closer it is to 1, the better is the model at predicting the target var.

Recall score is the proportion of actual positives that were identified correctly by model.

Precision is the proportion of positive identifications that were classified correctly by model

As a conclusion, the Logistic Regression model's performance was great at predicting risk for CVD.